小丫微信: epigenomics E-mail: figureya@126.com
作者:小丫
Y叔编辑校验
用GOplot展示clusterProfiler的富集分析结果,画出paper里的⭕️图
出自https://www.nature.com/articles/s41467-017-01195-y
适用于:既想用clusterProfiler做富集分析,又想用GOplot展示结果,但是不知道二者怎样衔接的小伙伴。
此处以GO为例,获得clusterProfiler的富集分析结果,生成GOplot所需的格式,用GOplot画⭕️图。
clusterProfiler除了擅长做GO富集分析以外,还可以用KEGG、Diseaes、Reactome、DAVID、MSigDB等注释库做富集分析。另外,enrichplot自带的gseaplot2函数可以生成GSEA结果的矢量图、多条pathway画到同一个图上,完美代替Java版本的GSEA,看这篇:https://mp.weixin.qq.com/s/AamfRz0BUENCi_1P0P-Ruw。有关clusterProfiler、enrichplot本身的问题,建议加入Y叔知识星球提问。
安装需要的包
#使用国内镜像安装包
options("repos"= c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("clusterProfiler", version = "3.8")
BiocManager::install("org.Mm.eg.db", version = "3.8")
install.packages('GOplot')
加载需要用到的包
library(clusterProfiler)
## Warning: package 'clusterProfiler' was built under R version 3.5.1
##
## clusterProfiler v3.10.0 For help: https://guangchuangyu.github.io/software/clusterProfiler
##
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
library(AnnotationHub)
## Warning: package 'AnnotationHub' was built under R version 3.5.1
## Loading required package: BiocGenerics
## Warning: package 'BiocGenerics' was built under R version 3.5.1
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind,
## colMeans, colnames, colSums, dirname, do.call, duplicated,
## eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
## Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which, which.max,
## which.min
library(AnnotationDbi)
## Warning: package 'AnnotationDbi' was built under R version 3.5.1
## Loading required package: stats4
## Loading required package: Biobase
## Warning: package 'Biobase' was built under R version 3.5.1
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:AnnotationHub':
##
## cache
## Loading required package: IRanges
## Warning: package 'IRanges' was built under R version 3.5.1
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 3.5.1
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
library(GOplot)
## Loading required package: ggplot2
## Loading required package: ggdendro
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Biobase':
##
## combine
## The following object is masked from 'package:BiocGenerics':
##
## combine
## Loading required package: RColorBrewer
library(ggplot2)
Sys.setenv(LANGUAGE = "en") #显示英文报错信息
options(stringsAsFactors = FALSE) #禁止chr转成factor
如果你的基因ID已经是ENTREZ ID,并保存成very_easy_input_**.csv的格式,就可以跳过这步,直接进入“富集分析”
根据基因名gene symbol找到相应的ensembl ID。模式生物和非模式生物的转换代码稍有不同,根据自己的物种,选择运行以下两段之一。
此处以例文中的小鼠数据为例,其他18个物种只需更改OrgDb = "org.Mm.eg.db"参数
具体物种对应的R包名字看这页:http://bioconductor.org/packages/release/BiocViews.html#___OrgDb
如果你关心的物种不在这个列表里,请跳过这步,直接进入“非模式生物ENTREZ ID的获取”
gsym.fc <- read.csv("easy_input_Mm.csv", as.is = T)
dim(gsym.fc)
## [1] 201 2
#查看有哪些ID
#keytypes(org.Mm.eg.db)
gsym.id <- bitr(gsym.fc$SYMBOL, #基因名
fromType = "SYMBOL", #从gene symbol
toType = "ENTREZID", #提取ENTREZ ID
OrgDb = "org.Mm.eg.db") #相应物种的包,人是org.Hs.eg.db
## Warning in bitr(gsym.fc$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", :
## 1.49% of input gene IDs are fail to map...
head(gsym.id)
## SYMBOL ENTREZID
## 1 Atrn 11990
## 2 Cacna1e 12290
## 3 Cacna1g 12291
## 4 Capn2 12334
## 5 Cdh11 12552
## 6 Cdh4 12561
#合并基因名、ENTREZID、foldchange
idvec <- gsym.id$ENTREZID
names(idvec) <- gsym.id$SYMBOL
gsym.fc$ENTREZID <- idvec[gsym.fc$SYMBOL]
head(gsym.fc)
## SYMBOL log2fc ENTREZID
## 1 Atrn -0.308724 11990
## 2 Cacna1e -0.352114 12290
## 3 Cacna1g -0.300020 12291
## 4 Capn2 -0.199424 12334
## 5 Cdh11 -0.206564 12552
## 6 Cdh4 -0.500161 12561
#保存到文件
write.csv(gsym.fc[,c(3,2)], "very_easy_input_Mm.csv", quote = F, row.names = F)
模式生物请跳过这段,直接进入“富集分析”。
参考资料:
Y叔公众号biobabble:https://mp.weixin.qq.com/s/lHKZtzpN2k9uPN7e6HjH3w
AnnotationHub包的说明文档:https://bioconductor.org/packages/release/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub.html
准备工作:
把基因名跟ENTREZ ID对应关系的注释文件保存到本地文件,以后想要获取同一基因组版本的ENTREZ ID时,直接导入这个“Zmays(物种名).AH66225(版本).sqlite”文件就可以了。此处以玉米为例:
hub <- AnnotationHub() #大概需要2分钟,网速慢就要更久
#查看AnnotationHub里有哪些物种,记住idx列里的AH***
#d <- display(hub)
#或者直接搜zea(玉米拉丁名的一部分)
query(hub, "zea")
#此处下载“AH61838”
maize.db <- hub[['AH66225']] #大概需要3分钟,网速慢就要更久
#查看包含的基因数
length(keys(maize.db))
#查看包含多少种ID
columns(maize.db)
#查看前几个基因的ID长啥样
select(maize.db, keys(maize.db)[1:3],
c("REFSEQ", "SYMBOL"), #你想获取的ID
"ENTREZID")
#保存到文件
saveDb(maize.db, "Zmays.AH66225.sqlite")
根据基因名gene symbol获取ENTREZ ID
#读入差异基因
gsym.fc <- read.csv("easy_input_Zm.csv")
head(gsym.fc)
## SYMBOL log2fc
## 1 pco072676(750) 1.7600181
## 2 crr1 0.1031444
## 3 sqs1 1.3818824
## 4 umc2347 -1.5934269
## 5 LOC541617 -0.9625051
## 6 mpk2 -1.9493218
dim(gsym.fc)
## [1] 1000 2
#读入前面保存到本地的注释文件
maize.db <- loadDb("Zmays.AH66225.sqlite")
gsym.id <- bitr(gsym.fc$SYMBOL, #基因名所在的列
"SYMBOL", #根据gene symbol
"ENTREZID", #提取ENTREZ ID
maize.db) #玉米注释
## 'select()' returned 1:many mapping between keys and columns
head(gsym.id)
## SYMBOL ENTREZID
## 1 pco072676(750) 541612
## 2 crr1 541613
## 3 sqs1 541614
## 4 umc2347 541615
## 5 LOC541617 541617
## 6 mpk2 541618
dim(gsym.id)
## [1] 1006 2
#合并基因名、ENTREZID、foldchange
idvec <- gsym.id$ENTREZID
names(idvec) <- gsym.id$SYMBOL
gsym.fc$ENTREZID <- idvec[gsym.fc$SYMBOL]
head(gsym.fc)
## SYMBOL log2fc ENTREZID
## 1 pco072676(750) 1.7600181 541612
## 2 crr1 0.1031444 541613
## 3 sqs1 1.3818824 541614
## 4 umc2347 -1.5934269 541615
## 5 LOC541617 -0.9625051 541617
## 6 mpk2 -1.9493218 541618
dim(gsym.fc)
## [1] 1000 3
#保存到文件
write.csv(gsym.fc[,c(3,2)], "very_easy_input_Zm.csv", quote = F, row.names = F)
如果你已经做好了富集分析,并且保存成“enrichGO_output.csv”的格式,就可以跳过这部分,直接进入“把clusterProfiler输出的富集分析结果转成GOplot所需的格式”
#ENTREZ ID位于第一列,log2foldchange位于第二列
id.fc <- read.csv("very_easy_input_Mm.csv", as.is = T)
head(id.fc)
## ENTREZID log2fc
## 1 11990 -0.308724
## 2 12290 -0.352114
## 3 12291 -0.300020
## 4 12334 -0.199424
## 5 12552 -0.206564
## 6 12561 -0.500161
dim(id.fc)
## [1] 201 2
ego <- enrichGO(gene = id.fc$ENTREZID,
#小鼠用这行
OrgDb = org.Mm.eg.db,
#人类用这行
#OrgDb = org.Hs.eg.db,
#非模式生物用这行,例如玉米
#OrgDb = maize.db,
ont = "BP", #或MF或CC
pAdjustMethod = "BH",
#pvalueCutoff = 0.001,
qvalueCutoff = 0.01)
dim(ego)
## [1] 169 9
#把富集分析结果保存到文件
write.csv(ego,"enrichGO_output.csv",quote = F)
富集的GO term有些是相似的,可以用语义学方法,合并相似的GO term。需要较长时间。
参考资料:https://guangchuangyu.github.io/2015/10/use-simplify-to-remove-redundancy-of-enriched-go-terms/
#ego2 <- simplify(ego, cutoff = 0.7, by = "p.adjust", select_fun = min)
#dim(ego2)
#write.csv(ego2,"enrichGO_simplify_output.csv",quote = F)
用enrichplot自带的函数画⭕️
参考资料:https://bioconductor.org/packages/release/bioc/vignettes/enrichplot/inst/doc/enrichplot.html
#把ENTREZ ID转为gene symbol
egox <- setReadable(ego, 'org.Mm.eg.db', #物种
'ENTREZID')
#把基因倍数信息转成画图所需的格式
geneList <- id.fc$log2fc
names(geneList) <- id.fc$ENTREZID
#画⭕️图
cnetplot(egox,
foldChange = geneList,
#foldChange = NULL, #不展示倍数
circular = TRUE,
#node_label = FALSE, #如果太多,就不要显示基因名了
showCategory = 4, #显示富集的term数量,默认5
colorEdge = TRUE)
#保存到pdf文件
ggsave("clusterProfiler_circle.pdf", width = 8, height = 5)
#不画成⭕️,process分开,效果更好呢
cnetplot(egox,
foldChange = geneList,
#foldChange = NULL, #不展示倍数
#circular = TRUE,
#node_label = FALSE, #不显示基因名
showCategory = 4, #显示的富集term数量,默认5
colorEdge = TRUE)
ggsave("clusterProfiler_not_circle.pdf", width = 8, height = 5)
#读入富集分析结果
ego <- read.csv("enrichGO_output.csv", header = T)
ego[1,]
## X ID Description GeneRatio BgRatio
## 1 GO:0046777 GO:0046777 protein autophosphorylation 14/194 235/23239
## pvalue p.adjust qvalue
## 1 1.205327e-08 2.289313e-05 1.75051e-05
## geneID
## 1 13869/14254/14255/16001/16337/16542/16589/17096/18750/15467/27103/53416/54635/16590
## Count
## 1 14
go <- data.frame(Category = "BP",
ID = ego$ID,
Term = ego$Description,
Genes = gsub("/", ", ", ego$geneID),
adj_pval = ego$p.adjust)
#基因变化倍数
id.fc <- read.csv("very_easy_input_Mm.csv", as.is = T)
head(id.fc)
## ENTREZID log2fc
## 1 11990 -0.308724
## 2 12290 -0.352114
## 3 12291 -0.300020
## 4 12334 -0.199424
## 5 12552 -0.206564
## 6 12561 -0.500161
genelist <- data.frame(ID = id.fc$ENTREZID, logFC = id.fc$log2fc)
#把富集分析和倍数整合在一起
circ <- circle_dat(go, genelist)
head(circ)
## category ID term count genes logFC
## 1 BP GO:0046777 protein autophosphorylation 14 13869 -0.301868
## 2 BP GO:0046777 protein autophosphorylation 14 14254 -0.183248
## 3 BP GO:0046777 protein autophosphorylation 14 14255 -0.594008
## 4 BP GO:0046777 protein autophosphorylation 14 16001 -0.213881
## 5 BP GO:0046777 protein autophosphorylation 14 16337 -0.182849
## 6 BP GO:0046777 protein autophosphorylation 14 16542 -0.150017
## adj_pval zscore
## 1 2.289313e-05 -1.069045
## 2 2.289313e-05 -1.069045
## 3 2.289313e-05 -1.069045
## 4 2.289313e-05 -1.069045
## 5 2.289313e-05 -1.069045
## 6 2.289313e-05 -1.069045
#根据ENTREZ ID提取gene symbol
id.gsym <- bitr(circ$genes, #基因名
fromType = "ENTREZID", #从ENTREZ ID
toType = "SYMBOL", #提取gene symbol
OrgDb = "org.Mm.eg.db") #相应物种的包
## 'select()' returned 1:1 mapping between keys and columns
#把circ里的ENTREZ ID换成gene symbol
rownames(id.gsym) <- id.gsym$ENTREZID
circ.gsym <- circ
circ.gsym$genes <- id.gsym[circ$genes,]$SYMBOL
head(circ.gsym)
## category ID term count genes logFC
## 1 BP GO:0046777 protein autophosphorylation 14 Erbb4 -0.301868
## 2 BP GO:0046777 protein autophosphorylation 14 Flt1 -0.183248
## 3 BP GO:0046777 protein autophosphorylation 14 Flt3 -0.594008
## 4 BP GO:0046777 protein autophosphorylation 14 Igf1r -0.213881
## 5 BP GO:0046777 protein autophosphorylation 14 Insr -0.182849
## 6 BP GO:0046777 protein autophosphorylation 14 Kdr -0.150017
## adj_pval zscore
## 1 2.289313e-05 -1.069045
## 2 2.289313e-05 -1.069045
## 3 2.289313e-05 -1.069045
## 4 2.289313e-05 -1.069045
## 5 2.289313e-05 -1.069045
## 6 2.289313e-05 -1.069045
#现在就可以用GOplot画图了,例如
#GOBar(subset(circ, category == 'BP'))
#GOBubble(circ, labels = 3)
#GOCircle(circ)
准备画⭕️图所需的数据格式
#参数设置
n = 5 #圈图需要选定term,这里画前面5个
chord <- chord_dat(circ, genelist, go$Term[1:n])
head(chord)
## protein autophosphorylation cell junction assembly
## 11990 0 0
## 12552 0 1
## 12561 0 1
## 12565 0 1
## 13052 0 0
## 13602 0 0
## cell junction organization regulation of developmental growth
## 11990 0 1
## 12552 1 0
## 12561 1 1
## 12565 1 0
## 13052 1 1
## 13602 0 0
## cell-cell adhesion via plasma-membrane adhesion molecules logFC
## 11990 0 -0.308724
## 12552 1 -0.206564
## 12561 1 -0.500161
## 12565 1 -0.255698
## 13052 1 -0.200243
## 13602 1 -0.385248
#根据ENTREZ ID提取gene symbol
id.gsym <- bitr(row.names(chord), #基因名
fromType = "ENTREZID", #从ENTREZ ID
toType = "SYMBOL", #提取gene symbol
OrgDb = "org.Mm.eg.db") #相应物种的包
## 'select()' returned 1:1 mapping between keys and columns
#把chord的列名ENTREZ ID换成gene symbol
rownames(id.gsym) <- id.gsym$ENTREZID
head(id.gsym)
## ENTREZID SYMBOL
## 11990 11990 Atrn
## 12552 12552 Cdh11
## 12561 12561 Cdh4
## 12565 12565 Cdh9
## 13052 13052 Cxadr
## 13602 13602 Sparcl1
chord.gsym <- chord
row.names(chord.gsym) <- id.gsym[row.names(chord),]$SYMBOL
head(chord.gsym)
## protein autophosphorylation cell junction assembly
## Atrn 0 0
## Cdh11 0 1
## Cdh4 0 1
## Cdh9 0 1
## Cxadr 0 0
## Sparcl1 0 0
## cell junction organization regulation of developmental growth
## Atrn 0 1
## Cdh11 1 0
## Cdh4 1 1
## Cdh9 1 0
## Cxadr 1 1
## Sparcl1 0 0
## cell-cell adhesion via plasma-membrane adhesion molecules
## Atrn 0
## Cdh11 1
## Cdh4 1
## Cdh9 1
## Cxadr 1
## Sparcl1 1
## logFC
## Atrn -0.308724
## Cdh11 -0.206564
## Cdh4 -0.500161
## Cdh9 -0.255698
## Cxadr -0.200243
## Sparcl1 -0.385248
用⭕️图展示每个term里的基因及其变化倍数
GOChord(chord.gsym,
space = 0.02, #基因方块间隙
gene.order = 'logFC',
lfc.col = c('darkgoldenrod1', 'black', 'cyan1'), #自定义变化倍数的颜色
gene.space = 0.25, #基因名跟⭕️的相对距离
gene.size = 8, #基因名字体大小
border.size = 0.1, #中间曲线的黑色边的粗细
process.label = 8) #term字体大小
## Warning: Using size for a discrete variable is not advised.
## Warning: Removed 5 rows containing missing values (geom_point).
ggsave("GOChord.pdf", width = 12, height = 14)
## Warning: Using size for a discrete variable is not advised.
## Warning: Removed 5 rows containing missing values (geom_point).
用⭕️聚类图展示相似变化趋势的基因所在的term
#定义足够多的颜色,后面从这里选颜色
mycol <- c("#223D6C","#D20A13","#FFD121","#088247","#58CDD9","#7A142C","#5D90BA","#431A3D","#91612D","#6E568C","#E0367A","#D8D155","#64495D","#7CC767")
GOCluster(circ.gsym, go$Term[1:n],
clust.by = 'logFC', #用变化倍数聚类
#clust.by = 'term', 用富集的term聚类
lfc.col = c('darkgoldenrod1', 'black', 'cyan1'), #自定义变化倍数的颜色
lfc.space = 0.05, #倍数跟树间的空隙大小
lfc.width = 0.01, #变化倍数的⭕️宽度
term.col = mycol[1:n], #自定义term的颜色
term.space = 0.05, #倍数跟term间的空隙大小
term.width = 0.15) #富集term的⭕️宽度
## Warning: Using size for a discrete variable is not advised.
## Warning: Removed 5 rows containing missing values (geom_point).
ggsave("GOCluster.pdf", width = 12, height = 14)
## Warning: Using size for a discrete variable is not advised.
## Warning: Removed 5 rows containing missing values (geom_point).
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] org.Mm.eg.db_3.7.0 GOplot_1.0.2 RColorBrewer_1.1-2
## [4] gridExtra_2.3 ggdendro_0.1-20 ggplot2_3.1.0
## [7] AnnotationDbi_1.44.0 IRanges_2.16.0 S4Vectors_0.20.0
## [10] Biobase_2.42.0 AnnotationHub_2.14.1 BiocGenerics_0.28.0
## [13] clusterProfiler_3.10.0
##
## loaded via a namespace (and not attached):
## [1] enrichplot_1.2.0 bit64_0.9-7
## [3] progress_1.2.0 httr_1.3.1
## [5] rprojroot_1.3-2 UpSetR_1.3.3
## [7] tools_3.5.0 backports_1.1.2
## [9] R6_2.3.0 DBI_1.0.0
## [11] lazyeval_0.2.1 colorspace_1.3-2
## [13] withr_2.1.2 tidyselect_0.2.5
## [15] prettyunits_1.0.2 bit_1.1-14
## [17] compiler_3.5.0 xml2_1.2.0
## [19] labeling_0.3 triebeard_0.3.0
## [21] scales_1.0.0 ggridges_0.5.1
## [23] stringr_1.3.1 digest_0.6.18
## [25] rmarkdown_1.10 DOSE_3.8.0
## [27] pkgconfig_2.0.2 htmltools_0.3.6
## [29] rlang_0.3.0.1 rstudioapi_0.8
## [31] RSQLite_2.1.1 shiny_1.2.0
## [33] bindr_0.1.1 gridGraphics_0.3-0
## [35] farver_1.0 jsonlite_1.5
## [37] BiocParallel_1.16.0 GOSemSim_2.8.0
## [39] dplyr_0.7.7 magrittr_1.5
## [41] ggplotify_0.0.3 GO.db_3.7.0
## [43] Matrix_1.2-15 Rcpp_1.0.0
## [45] munsell_0.5.0 viridis_0.5.1
## [47] stringi_1.2.4 yaml_2.2.0
## [49] ggraph_1.0.2 MASS_7.3-51.1
## [51] plyr_1.8.4 qvalue_2.14.0
## [53] grid_3.5.0 blob_1.1.1
## [55] promises_1.0.1 ggrepel_0.8.0
## [57] DO.db_2.9 crayon_1.3.4
## [59] lattice_0.20-38 cowplot_0.9.3
## [61] splines_3.5.0 hms_0.4.2
## [63] knitr_1.20 pillar_1.3.0
## [65] fgsea_1.8.0 igraph_1.2.2
## [67] reshape2_1.4.3 fastmatch_1.1-0
## [69] glue_1.3.0 evaluate_0.12
## [71] data.table_1.11.8 BiocManager_1.30.3
## [73] httpuv_1.4.5 tweenr_1.0.0
## [75] urltools_1.7.1 gtable_0.2.0
## [77] purrr_0.2.5 tidyr_0.8.2
## [79] assertthat_0.2.0 ggforce_0.1.3
## [81] mime_0.6 europepmc_0.3
## [83] xtable_1.8-3 later_0.7.5
## [85] viridisLite_0.3.0 tibble_1.4.2
## [87] rvcheck_0.1.1 memoise_1.1.0
## [89] units_0.6-1 bindrcpp_0.2.2
## [91] interactiveDisplayBase_1.20.0